Least-Squares Policy Iteration: Bias-Variance Trade-off in Control Problems
نویسندگان
چکیده
In the context of large space MDPs with linear value function approximation, we introduce a new approximate version of λ-Policy Iteration (Bertsekas & Ioffe, 1996), a method that generalizes Value Iteration and Policy Iteration with a parameter λ ∈ (0, 1). Our approach, called Least-Squares λ Policy Iteration, generalizes LSPI (Lagoudakis & Parr, 2003) which makes efficient use of training samples compared to classical temporaldifferences methods. The motivation of our work is to exploit the λ parameter within the least-squares context, and without having to generate new samples at each iteration or to know a model of the MDP. We provide a performance bound that shows the soundness of the algorithm. We show empirically on a simple chain problem and on the Tetris game that this λ parameter acts as a bias-variance trade-off that may improve the convergence and the performance of the policy obtained.
منابع مشابه
Adaptive Importance Sampling with Automatic Model Selection in Value Function Approximation
Off-policy reinforcement learning is aimed at efficiently reusing data samples gathered in the past, which is an essential problem for physically grounded AI as experiments are usually prohibitively expensive. A common approach is to use importance sampling techniques for compensating for the bias caused by the difference between data-sampling policies and the target policy. However, existing o...
متن کاملModel-Free Least-Squares Policy Iteration
We propose a new approach to reinforcement learning which combines least squares function approximation with policy iteration. Our method is model-free and completely off policy. We are motivated by the least squares temporal difference learning algorithm (LSTD), which is known for its efficient use of sample experiences compared to pure temporal difference algorithms. LSTD is ideal for predict...
متن کاملUniform CR Bound: Implement ation Issues And Applications
We apply a uniform Cramer-Rao (CR) bound [l] to study the bias-variance trade-offs in single photon emission computed tomography (SPECT) image reconstruction. The uniform CR bound is used to specify achievable and unachievable regions in the bias-variance trade-off plane. The image reconstruction algorithms considered in this paper are: 1) Space alternating generalized EM and 2) penalized weigh...
متن کاملLeast-Squares Methods in Reinforcement Learning for Control
Least-squares methods have been successfully used for prediction problems in the context of reinforcement learning, but little has been done in extending these methods to control problems. This paper presents an overview of our research efforts in using least-squares techniques for control. In our early attempts, we considered a direct extension of the Least-Squares Temporal Difference (LSTD) a...
متن کاملAveraged Least-Mean-Squares: Bias-Variance Trade-offs and Optimal Sampling Distributions
We consider the least-squares regression problem and provide a detailed asymptotic analysis of the performance of averaged constant-step-size stochastic gradient descent. In the strongly-convex case, we provide an asymptotic expansion up to explicit exponentially decaying terms. Our analysis leads to new insights into stochastic approximation algorithms: (a) it gives a tighter bound on the allo...
متن کامل